Web Scale Taxonomy Cleansing
نویسندگان
چکیده
Large ontologies and taxonomies are automatically harvested from web-scale data. These taxonomies tend to be huge, noisy, and contains little context. As a result, cleansing and enriching those largescale taxonomies becomes a great challenge. A natural way to enrich a taxonomy is to map the taxonomy to existing datasets that contain rich information. In this paper, we study the problem of matching two web scale taxonomies. Besides the scale of the problem, we address the challenge that the taxonomies may not contain enough context (such as attribute values). As existing entity resolution techniques are based directly or indirectly on attribute values as context, we must explore external evidence for entity resolution. Specifically, we explore positive and negative evidence in external data sources such as the web and in other taxonomies. To integrate positive and negative evidence, we formulate the entity resolution problem as a problem of finding optimal multi-way cuts in a graph. We analyze the complexity of the problem, and propose a Monte Carlo algorithm for finding greedy cuts. We conduct extensive experiments and compare our approach with three existing methods to demonstrate the advantage of our approach.
منابع مشابه
Understanding Tables on the Web
The Web contains a wealth of information, and a key challenge is to make this information machine processable. Because natural language understanding at web scale remains difficult and costly at present, in this paper, we focus our attention on understanding well-structured html tables on the Web. From 0.3 billion Web documents, we obtain 1.95 billion tables, and 0.5-1% of these contain meaning...
متن کاملA Large Scale Taxonomy Mapping Evaluation
Matching hierarchical structures, like taxonomies or web directories, is the premise for enabling interoperability among heterogenous data organizations. While the number of new matching solutions is increasing the evaluation issue is still open. This work addresses the problem of comparison for pairwise matching solutions. A methodology is proposed to overcome the issue of scalability. A large...
متن کاملText Classification for a Large-Scale Taxonomy Using Dynamically Mixed Local and Global Models for a Node
Hierarchical text classification for a large-scale Web taxonomy is challenging because the number of categories hierarchically organized is large and the training data for deep categories are usually sparse. It’s been shown that a narrow-down approach involving a search of the taxonomical tree is an effective method for the problem. A recent study showed that both local and global information f...
متن کاملGraph-Based Wrong IsA Relation Detection in a Large-Scale Lexical Taxonomy
Knowledge base(KB) plays an important role in artificial intelligence. Much effort has been taken to both manually and automatically construct web-scale knowledge bases. Comparing with manually constructed KBs, automatically constructed KB is broader but with more noises. In this paper, we study the problem of improving the quality for automatically constructed web-scale knowledge bases, in par...
متن کاملMusic Data Analysis: A State-of-the-art Survey
Music accounts for a significant chunk of interest among various online activities. This is reflected by wide array of alternatives offered in music related web/mobile apps, information portals, featuring millions of artists, songs and events attracting user activity at similar scale. Availability of large scale structured and unstructured data has attracted similar level of attention by data s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PVLDB
دوره 4 شماره
صفحات -
تاریخ انتشار 2011